A brain organoid, or cerebral organoid, is a self-organizing tree-dimensional tissue that can mimic the architecture and functionality of the human brain, which is derived from human embryonic stem cells or pluripotent stem cells [1]. During embryonic development, various cell types are generated for the human brain. Theoretically, studying human brain organoids holds unprecedented promise for understanding the brain development and regulation [2].
Cellular diversity develops in a specific temporal sequence in the cerebral cortex [1]. After one-month culturing, cerebral organoids expanded and showed early brain regionalization, including the forebrain, choroid plexus, hippocampaus and retinal zones [3]. Individual cells can be isolated from the brain organoids at 3 and 6 months with single-cell mRNA sequencing. Some cell types can be found in both 3- and 6-month organoids, but some only presented at 6 months, e.g. bipolar cells. This indicated that long-term culture allows brain organoids to develop continuously, increasing cellular diversity and promoting neuronal maturation [1]. In this project, we will focus on the differential regulation in cell types that presented both in 3- and 6-month organoids.
RNA velocity, a high-dimensional vector, can predict the future state of individual cells on a timescale of hours [4]. It can be determined by the abundance of spliced and unspliced mRNA. Positive RNA velocity refers to an increase in unspliced transcripts, indicating up-regulation in spliced transcripts. Conversely, down-regulation is indicated by the negative velocity [4]. Therefore, RNA velocity is expected to greatly aid the analysis of development lineages and regulations.
This project aimed to explore the differential regulation between 3-month and 6-month organoids from scRNA-seq data using three methods (eisaR, BRIE2 and DEXseq_USA) and compare their results. The comparison of these methods might suggest which kind of differential analysis is most useful in identifying genes related to brain functions and in providing extra information. We first performed UMAP clustering of cells and the RNA velocity to determine whether differences can be observed between 3- and 6-month organoids. Then we used three methods (eisaR, BRIE2 and DEXSeq_USA) to investigate the differences in the relative abundance of spliced and unspliced reads between conditions in cell types that presented in both 3- and 6-month groups. Finally, we compared the results generated by the three methods.
The brain organoids single cell RNA-sequencing data [2] (scRNA-seq) data have been deposited at Gene Expression Omnibus with accession number GSE129519 and downloaded from Single Cell portal (https://singlecell.broadinstitute.org/single_cell/study/SCP282/reproducible-brain-organoids). This dataset (Table 1) contains scRNA-seq data from 78,379 cells from nine individual organoids from two stem cell lines, PGP1 (two independent batches, b1 and b2) and HUES66 (one batch), at three months of growth. As astroglial cells are typically produced later in development, the dataset also contains scRNA-sequencing of an additional 87,863 cells from 12 individual organoids (PGP1: two independent batches, b1 and b3; GM08330 and 11a: one batch each). To explore the differential regulation over time, we compared the cell lines from organoids from the same differentiation batch (PGP1 b1) but collected at three and six months.
Table 1. Statistics from scRNA-seq data
In this section, we first divided the organoids into two groups (3- and 6-month group). Then we performed quality control on cells and filtered lowly abundant genes to remove undetected genes. Finally, we assigned cell types based on the original study and computed normalized data. The pre-processed data was visualized by the UMAPs that are colored by group (3- and 6-month), biological replicate (organoids) and cell types. Cell types with at least 50 cells per group were selected for differential analyses.
# load packages:
library(Matrix)
library(SingleCellExperiment)
library(scater)
#library(anndata)
library(UpSetR)
library('rjson')
library(ggplot2)
library(reshape2)
#library(knitr)
#library(png)
# set working directory:
setwd('~/Desktop/STA_426_UZH/Project/')
The original dataset was quantified through alevin-fry and thus being separated into spliced and unspliced reads. A function was defined to store the data in a single-cell experiment, and to determine if the group belongs to 3- or 6-month. As shown in Table 1, organoid 1-3 belong to 3-month development (group 1) and organoid 16-18 belong to 6-month development (group 2).
# Define function to load abundances: spliced and unspliced reads, and cell and gene names:
# store data in a single-cell experiment, and specify the group if (3 vs. 6 months)
import_abundances <- function (sample_id) {
path <- paste0('~/Desktop/STA_426_UZH/Project/Preprocess/',sample_id, '/')
spliced_counts <- readMM(paste0(path,
"spliced_quants_mat.mtx"))
unspliced_counts <- readMM(paste0(path,
"unspliced_quants_mat.mtx"))
genes <- read.delim(paste0(path, "quants_mat_cols.txt"), header = FALSE,
as.is = TRUE)[, 1]
cells <- read.delim(paste0(path, "quants_mat_rows.txt"), header = FALSE,
as.is = TRUE)[, 1]
# we define "counts" as spliced counts (mature mRNA)
temp <- SingleCellExperiment(list(counts = t(spliced_counts),
unspliced = t(unspliced_counts)))
temp$sample_id <- sample_id
rownames(temp) <- genes
# add sample_id to cell id, so that barcodes from different samples have different ids.
colnames(temp) <- paste0(sample_id, ".", cells)
temp$group <- ifelse(temp$sample_id %in% paste0("organoid", c(1:3)) , "3_mon", "6_mon")
return (temp)
}
# specify sample names:
# organoids 1:3 belong to group 1 (3 months of development)
# organoids 16:18 belong to group 2 (6 months of development)
samples <- paste0("organoid", c(1:3, 16:18) )
# use the previous function to load the samples:
sce <- lapply(samples[[1]], FUN = function (samp) {
import_abundances(sample_id = samp)
})
listOfDFs <- list()
for(i in 1:6){
sces <- lapply(samples[[i]], FUN = function (samp) {
import_abundances(sample_id = samp)
})
listOfDFs <- append(listOfDFs,sces)
}
## Warning in scan(file, nmax = nz, quiet = TRUE, what = list(i = integer(), :
## embedded nul(s) found in input
# merge samples into 1 sce:
sce <- do.call("cbind", listOfDFs)
# store sample_id (organoid id) as factor
colData(sce)$sample_id <- factor(colData(sce)$sample_id)
To filter the data to only include true cells that are of high quality, we performed quality control on cells. In addition, we also filtered lowly abundant genes to remove undetected genes (less than 10 non-zero cells).
# calculate per-cell quality control (QC) metrics
qc <- perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
# %of cells removed:
100 * mean(ol)
## [1] 3.388218
rm(ol)
# Filter lowly abundant genes
filter <- rowSums(assays(sce)$counts) >= 10
sce <- sce[filter, ]
# assign cell types based on the original study:
md <- read.csv("~/Desktop/STA_426_UZH/Project/Preprocess/meta_combined.txt", sep = "\t")
md <- md[grepl("PGP1", md$Batch), ]
md$SEQ <- sapply(md$NAME, FUN = function (x) strsplit(x, split = "_")[[1]][3])
md$ID <- paste0('organoid',md$Organoid, ".", md$SEQ)
md <- md[, c("ID", "CellType")]
matches = match(colnames(sce), md$ID)
head(colnames(sce)); head(md$ID[matches])
## [1] "organoid1.CGCCAAGCAATGCCAT" "organoid1.GTCACAAGTGCTTCTC"
## [3] "organoid1.ATTGGTGTCACCTTAT" "organoid1.GTTCTCGAGCGCCTCA"
## [5] "organoid1.TCTTCGGGTCCAGTTA" "organoid1.AACTCCCGTTAAAGTG"
## [1] "organoid1.CGCCAAGCAATGCCAT" "organoid1.GTCACAAGTGCTTCTC"
## [3] "organoid1.ATTGGTGTCACCTTAT" "organoid1.GTTCTCGAGCGCCTCA"
## [5] "organoid1.TCTTCGGGTCCAGTTA" "organoid1.AACTCCCGTTAAAGTG"
sce$cell_type <- md$CellType[matches]
After pre-processing the data, we computed the normalized data via logNormCounts and clustered cells from organoids derived from PGP1 b1 batch collected at three and six months. The UMAP colored by time (Figure 1) shows distinct separation between 3- and 6-month organoid cells. In addition, the large overlap across samples belonging to the same sample in the UMAP colored by sample id (Figure 2) indicated that we have successfully assigned the organoid samples to its corresponding group (3-month group: organoid 1-3; 6-month group: organoid 16-18). The UMAP colored by cell types quantified the differences in cellular composition over time (Figure 3). Many cell types only present in one group. For example, the 3-month cultures contained substantial numbers of astrocytes that are not contained by the 6-month cultures.
# compute sum-factors, normalize data and compute UMAP:
sce <- computeLibraryFactors(sce)
sce <- logNormCounts(sce)
sce <- runUMAP(sce)
# plot UMAPs, coloured by group, sample_id and cell_type:
# clear separation of cells by group/time:
plotUMAP(sce, colour_by = "group")
Figure 1. UMAP colored by time condition shows distinct separation between 3- and 6-month organoids.
# large overlap across samples belonging to the same sample:
plotUMAP(sce, colour_by = "sample_id")
Figure 2. UMAP colored by sample ids shows large overlap across samples belonging to the same organoid.
# clear clustering of cells types, within each group:
plotUMAP(sce, colour_by = "cell_type")
Figure 3. UMAP colored by cell types shows clear clustering of cell types within each group
Finally, we prepared the data for differential analyses by keeping the cell types that are present in both groups: select cell types with at least 50 cells per group. 6 cell types were selected: CPNs, Cycling, Immature CPNs, Immature PNs, oRG and RG.
# count how many there are before filtering them out (if too many smt wrong):
sel_NAs = is.na(sce$cell_type)
sum( sel_NAs ); mean( sel_NAs)
## [1] 7300
## [1] 0.1597095
# 15 % NA assignment + some "Unknown".
# many cell types only present in 1 group only!
tab = table(sce$cell_type, sce$group); tab
##
## 3_mon 6_mon
## Astroglia 0 3435
## CFuPNs 2804 0
## CPNs 4769 1455
## Cycling 638 1713
## Immature CFuPNs 1173 0
## Immature CPNs 3293 2679
## Immature Interneurons 0 2260
## Immature PNs 1563 1358
## IPCs 0 170
## IPCs/Immature PNs 1395 0
## oRG 1017 5719
## RG 658 417
## Unknown 0 569
## Ventral Precursors 0 1323
table(sce$cell_type, sce$sample_id)
##
## organoid1 organoid16 organoid17 organoid18 organoid2
## Astroglia 0 1225 1000 1210 0
## CFuPNs 1311 0 0 0 971
## CPNs 1508 528 416 511 1992
## Cycling 222 756 499 458 239
## Immature CFuPNs 394 0 0 0 463
## Immature CPNs 920 1033 738 908 1352
## Immature Interneurons 0 922 768 570 0
## Immature PNs 596 570 321 467 604
## IPCs 0 71 56 43 0
## IPCs/Immature PNs 466 0 0 0 486
## oRG 313 2322 1690 1707 418
## RG 156 199 112 106 312
## Unknown 0 191 242 136 0
## Ventral Precursors 0 526 462 335 0
##
## organoid3
## Astroglia 0
## CFuPNs 522
## CPNs 1269
## Cycling 177
## Immature CFuPNs 316
## Immature CPNs 1021
## Immature Interneurons 0
## Immature PNs 363
## IPCs 0
## IPCs/Immature PNs 443
## oRG 286
## RG 190
## Unknown 0
## Ventral Precursors 0
# select cell types with at least 50 cells per group:
sel_cell_types = rownames(tab)[ rowSums(tab > 50) == 2 ];
sel_cell_types
## [1] "CPNs" "Cycling" "Immature CPNs" "Immature PNs"
## [5] "oRG" "RG"
sce_filt = sce[, sce$cell_type %in% sel_cell_types ]
# we have used this object to compare groups, in each cell-type, via edgeR, BRIE2 and DEXSeq_USA differential approaches
sce_filt
## class: SingleCellExperiment
## dim: 23486 25279
## metadata(0):
## assays(3): counts unspliced logcounts
## rownames(23486): ENSG00000233750.3 ENSG00000225972.1 ...
## ENSG00000276345.1 ENSG00000278817.1
## rowData names(0):
## colnames(25279): organoid1.CGCCAAGCAATGCCAT organoid1.GTCACAAGTGCTTCTC
## ... organoid18.ACCAGTATCAACGGCC organoid18.ATAACGCGTTTAGCTG
## colData names(4): sample_id group cell_type sizeFactor
## reducedDimNames(1): UMAP
## mainExpName: NULL
## altExpNames(0):
We used scVelo, generated by [5], to generalize RNA velocity of the human brain organoid cells collected at 3 and 6 months. In details, scVelo infers gene-specific rates of transcription, splicing and degradation, and recovers the latent time of the underlying cellular processes.
scVelo pipeline
First, we used the SingleCellExperiment data to generate the loom files for scVelo.
#library(anndata)
samples <- paste0("organoid", c(1:3, 16:18) )
for (sample in samples) {
# select individual sample:
xx<- sce[, sce$sample_id == sample]
# create AnnData:
ad <- AnnData(
X = t(assay(xx, "counts")),
layers = list(
# counts object in sce contain spliced counts:
spliced = t(assay(xx, "counts")),
unspliced = t(assay(xx, "unspliced"))
),
obsm = list(
# store UMAP:
X_umap = reducedDim(xx, "UMAP")
),
# store cell-type:
obs = data.frame(clusters = xx$cell_type)
)
write_loom(ad, paste0("adata_", sample, ".loom"),
write_obsm_varm = TRUE)
}
This process was executed by Python 3.7.2.
import scanpy as sc
import numpy as np
import anndata as ad
import scvelo as scv
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import os
import gc
gc.collect()
matplotlib.use("AGG")
scv.settings.set_figure_params("scvelo")
scv.settings.verbosity = 3
scv.settings.presenter_view = True
scv.logging.print_version()
plot_path = "./figure/"
# please change the loop range to (16,19) for 6-month group
for i in range(1,3):
path = "adata_organoid" + str(i) + ".loom"
adata = ad.read_loom(path)
adata.obs = adata.obs.astype("category")
print(adata, adata.obs)
temp_path = plot_path + "organoid" + str(i) + "/"
os.mkdir(temp_path)
#########################################################
# Basic preprocessing:
#########################################################
scv.pp.filter_and_normalize(adata, enforce = True)
scv.pp.moments(adata)
#########################################################
# Plot proportions of spliced and unspliced reads:
#########################################################
scv.pl.proportions(adata)
plt.savefig(temp_path + "proportions.png", dpi = 300, bbox_inches = "tight")
#########################################################
# Velocity tools:
#########################################################
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode = "dynamical")
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis = "umap")
plt.savefig(temp_path + "dynamical_model.png", dpi = 300, bbox_inches = "tight")
#########################################################
# Velocity pseudotime:
#########################################################
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
plt.savefig(temp_path + 'velocity_pseudo_time_graph.png', dpi=300, bbox_inches='tight')
#########################################################
# latent-time graph:
#########################################################
scv.tl.latent_time(adata)
scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)
plt.savefig(temp_path + 'velocity_latent_time_graph.png', dpi=300, bbox_inches='tight')
#########################################################
# PAGA:
#########################################################
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']
scv.tl.paga(adata, groups='clusters')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')
scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
min_edge_width=2, node_size_scale=1.5)
plt.savefig(temp_path + 'PAGA.png', dpi=300, bbox_inches='tight')
The RNA velocities (PAGA: Figure 4 and latent time: Figure 5) overall show similar trends between biological replicates in the same condition. In addition, the dissimilarity greatly increases between conditions. More specifically, we observed the similar topology and cell types if organoids were collected at the same time point. The similar temporal sequence of the transcriptomic events and cellular fates are also observed in the latent time graph of the same condition. However, the topology and trends show a large difference between groups.
For other outputs of scVelo (including dynamic model, proportions of cells, velocity pseudo time), please look at scVelo output.
scVelo result
PAGA, a trajectory inference method, firstly suggests inferred directed abstracted representations of trajectories by integrating RNA velocity to better model cell fate decisions [5].
Figure 4. The PAGA velocity for each organoid in 3- and 6-month group
The cell’s internal clock, which accurately characterizes the cell’s location in the underlying biological process, is represented by the inferred latent time. The latent time is associated with transcriptional dynamics and considers speed and direction of motion [5].
Figure 5. The velocity latent time for each organoid in 3- and 6-month group
The distinct separation in UMAP and significant difference between 3- and 6-month organoids implied differential regulation over time. Here we analysed the differeial usage (differences in the relative abundance) of spliced and unspliced reads between conditions using eisaR, BRIE2 and DEXSeq.
Briefly, eisaR and DEXSeq perform pseudo-bulk differential analyses: in each cell type, they aggregate cell counts (summation). eisaR and BRIE2 perform differential analyses on estimated spliced and unspliced reads.
(Please note that the differential analyses was done by Dr. Simone Tiberi and so no code.)
eisaR measures changes in pre-mRNA and mature RNA reads across different experimental conditions by analysing ordinary RNA-seq data and thus quantifying transcriptional and post-transcriptional regulation.
BRIE2 is a scalable Bayesian method for accurate identification of splicing phenotypes in scRNA-seq experiments and supports the analysis of splicing process between spliced and unspliced RNAs.
DEXSeq focuses on finding differential exon usage using RNA-seq exon counts between samples with different experimental designs. It allows users to run statistical tests using a model that uses the negative binomial distribution to estimate variation between biological replicates and generalized linear models for testing.
We are interested how the significant genes in each cell type differ between the three methods. As the large difference in the number of the significant results, we only consider top 100 significant genes of each method in each cell type. As mentioned, we compared cell types that presented in both 3- and 6-month organoids and so there are 600 genes (100 genes for CPNs, Cycling, Immature CPNs, Immature PNs, oRG and RG) for each method.
Here, we first used UpSet plot (Figure 6) to provide an efficient way to visualize intersections of the significant genes (FDR < 0.05) in each method (UpSet). As the organoid cells collected at different times should be highly associated with brain development and regulation, we then investigated whether these top 100 significant genes are associated with these functions (Gene level analysis, Figure 7). Finally, we looked at the “extra/unique” patterns each method returns and looked at whether these extra genes are associated the interested functions (Figure 8 & 9). If the extra genes return by a method are more likely to have brain functions, this method might be preferred for comparing the organoid cells as it can provide extra/unique information.
With the threshold FDR < 0.05, there are 313 significant results for eisaR, 160 for BRIE2 and 8075 for DEXSeq_USA.
# packages
#library(UpSetR)
# loading result file
load("comparsion/DF.RData")
# UpSet
thd = 0.05
DF_01 = data.frame(eisaR = ifelse(DF$eisaR < thd, 1, 0),
BRIE2 = ifelse(DF$BRIE2 < thd, 1, 0),
DEXSeq_USA = ifelse(DF$DEXSeq_USA < thd, 1, 0))
upset(DF_01)
Figure 6. UpSet plot on the significant results of eisaR, BRIE2 and DEXSeq.
Here we extracted the top 100 significant genes for each cell type by different methods.
cell_types = unique(DF$Cell_type); cell_types
## [1] "CPNs" "Cycling" "Immature CPNs" "Immature PNs"
## [5] "oRG" "RG"
TOP_100 = list()
xx = 100
for(i in 1:length(cell_types)){
cell = cell_types[i]
tmp = DF[DF$Cell_type == cell,]
# check top 100 genes from each method:
TOP_100[[i]] = data.frame(Cell_type = cell,
eisaR = (tmp$Gene_id[order(tmp$eisaR)])[1:xx],
BRIE2 = tmp$Gene_id[order(tmp$BRIE2)][1:xx],
DEXSeq_USA = tmp$Gene_id[order(tmp$DEXSeq_USA)][1:xx])
}
TOP_100 = do.call(rbind, TOP_100)
# top 100 genes from each method, in every cell-type
head(TOP_100)
## Cell_type eisaR BRIE2 DEXSeq_USA
## 1 CPNs ENSG00000148204 ENSG00000229117 ENSG00000002586
## 2 CPNs ENSG00000112902 ENSG00000164434 ENSG00000006451
## 3 CPNs ENSG00000113732 ENSG00000225783 ENSG00000007168
## 4 CPNs ENSG00000132329 ENSG00000143321 ENSG00000007944
## 5 CPNs ENSG00000137824 ENSG00000125821 ENSG00000021300
## 6 CPNs ENSG00000140632 ENSG00000114744 ENSG00000050344
# All significant genes
all_genes = unique(unlist(TOP_100[,-1]))
#length(all_genes)
#1178
We are interested whether individual genes are associated with brain development or brain regulation functions. By searching for lists of brain, regulation and development on The Human Protein Atlas, we can easily determine significant genes that play a role in these functions. Here we use functional genes to refer genes that have the interested functions.
# loading the JSON file from https://www.proteinatlas.org
# These json files are the lists of genes associated with brain, development and regulation
# e.g. 'development_brain' is the genes are associated with brain development
# and 'development' is the genes are associated with development
#library('rjson')
development <- fromJSON(file='comparsion/development.json')
development_brain <- fromJSON(file='comparsion/development_brain.json')
brain <- fromJSON(file='comparsion/brain.json')
regulation <- fromJSON(file='comparsion/regulation.json')
regulation_brain <- fromJSON(file='comparsion/regulation_brain.json')
# Extract functional genes in the json files
development_genes <- c()
development_brain_genes <- c()
brain_genes <- c()
regulation_genes <- c()
regulation_brain_genes <- c()
for(i in 1:length(development)){
development_genes <- append(development_genes,development[[i]][[1]])
}
for(i in 1:length(development_brain)){
development_brain_genes <- append(development_brain_genes,development_brain[[i]][[1]])
}
for(i in 1:length(brain)){
brain_genes <- append(brain_genes,brain[[i]][[1]])
}
for(i in 1:length(regulation)){
regulation_genes <- append(regulation_genes,regulation[[i]][[1]])
}
for(i in 1:length(regulation_brain)){
regulation_brain_genes <- append(regulation_brain_genes,regulation_brain[[i]][[1]])
}
# Looking for intersection of the top 100 genes in each method and functional genes
top_100_in_development <- list()
top_100_in_development_brain <- list()
top_100_in_brain <- list()
top_100_in_regulation <- list()
top_100_in_regulation_brain <- list()
methods = c("eisaR", "BRIE2", "DEXSeq_USA")
for(i in 1:length(cell_types)){
cell = cell_types[i]
TOP_100_cell = TOP_100[TOP_100$Cell_type==cell,]
for(j in 1:length(methods)){
method = methods[j]
DE_genes = TOP_100_cell[, colnames(TOP_100_cell) == method]
name = paste0(cell,'_',method)
top_100_in_development[[name]] = DE_genes[DE_genes %in% development_genes]
top_100_in_development_brain[[name]] = DE_genes[DE_genes %in% development_brain_genes]
top_100_in_brain[[name]] = DE_genes[DE_genes %in% brain_genes]
top_100_in_regulation[[name]] = DE_genes[DE_genes %in% regulation_genes]
top_100_in_regulation_brain[[name]] = DE_genes[DE_genes %in% regulation_brain_genes]
}
}
The number of significant genes that have brain, development and regulation are summarized in a table and visualized in a bar chart colored with methods. We observed a similar number of functional genes in the top 100 significant genes of three methods.
TOP_100_summary = list(development = top_100_in_development,
development_brain = top_100_in_development_brain,
brain = top_100_in_brain,
regulation = top_100_in_regulation,
regulation_brain = top_100_in_regulation_brain
)
RES = data.frame(matrix(NA, nrow = length(methods), ncol = length(TOP_100_summary)))
for(j in 1:length(TOP_100_summary)){
top = TOP_100_summary[[j]]
for(i in 1:length(methods)){
RES[i,j] = sum(sapply(top[ grep( methods[i], names(top)) ], length))
}
}
colnames(RES) = names(TOP_100_summary)
rownames(RES) = methods
# significant genes that exists in any function gene lists: genes in either development, brain or regulation
RES$any <- NA
for(i in 1:length(methods)){
any_genes <- c()
regulation <- unique(unlist(top_100_in_regulation[grep( methods[i], names(top_100_in_regulation))]))
development <- unique(unlist(top_100_in_development[grep( methods[i], names(top_100_in_development))]))
brain <- unique(unlist(top_100_in_brain[grep( methods[i], names(top_100_in_brain))]))
any_genes <- append(any_genes, brain)
any_genes <- append(any_genes, development)
any_genes <- append(any_genes, regulation)
RES[methods[i],'any'] <- length(unique(any_genes))
}
RES
## development development_brain brain regulation regulation_brain any
## eisaR 173 44 77 382 59 347
## BRIE2 153 32 50 363 40 371
## DEXSeq_USA 191 45 74 387 54 316
## Barplot
#library(ggplot2)
#library(reshape2)
RES$row.names <- rownames(RES)
plot_df <- melt(RES, id = c('row.names'))
ggplot(data = plot_df, aes(x=variable,y=value,fill = row.names)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle('Figure 7. The number of functional genes in the top 100 genes of each method for all six cell types.')
DEX_Seq obtains a much higher number of extra genes that have the interested functions than eisaR and BRIE2. However, the number of extra genes returned by each method has a large difference: 7727 for DEXseq, 125 for BRIE2 and 0 for eisaR. This indicated that it is unfair to compare the number of extra functional genes due to the large difference. Therefore, we also compared the proportion of the functional genes in the total extra genes.
extra_genes_sum_df <- data.frame(matrix(nrow = 3, ncol = 5))
colnames(extra_genes_sum_df) <- c('brain', 'brain_development', 'development', 'brain_regulation', 'regulation')
rownames(extra_genes_sum_df) <- c('DEXSeq_USA', 'eisaR', 'BRIE2')
#extra genes in each method
thrd = 0.05
sel_DEXSeq = (DF$DEXSeq < thrd) & (DF$eisaR > thrd) & (DF$BRIE2 > thrd)
sel_eisaR = (DF$DEXSeq > thrd) & (DF$eisaR < thrd) & (DF$BRIE2 > thrd)
sel_BRIE2 = (DF$DEXSeq > thrd) & (DF$eisaR > thrd) & (DF$BRIE2 < thrd)
sel <- list('DEXSeq_USA' = sel_DEXSeq, 'eisaR' = sel_eisaR, 'BRIE2' = sel_BRIE2)
for(i in 1:length(rownames(extra_genes_sum_df))){
name = rownames(extra_genes_sum_df)[i]
select = sel[[name]]
extra_genes = DF[select,]$Gene_id
extra_genes_sum_df[name,'brain'] <- sum(as.integer(extra_genes %in%
unique(unlist(brain_genes))))
extra_genes_sum_df[name,'brain_development'] <- sum(as.integer(extra_genes %in% unique(unlist(development_brain_genes))))
extra_genes_sum_df[name,'development'] <- sum(as.integer(extra_genes %in%
unique(unlist(development_genes))))
extra_genes_sum_df[name,'brain_regulation'] <- sum(as.integer(extra_genes %in% unique(unlist(regulation_brain_genes))))
extra_genes_sum_df[name,'regulation'] <- sum(as.integer(extra_genes %in%
unique(unlist(regulation_genes))))
}
extra_genes_sum_df
## brain brain_development development brain_regulation regulation
## DEXSeq_USA 670 418 1885 529 4680
## eisaR 0 0 0 0 0
## BRIE2 9 4 25 7 71
# Bar plot
extra_genes_sum_df$row.names <- rownames(extra_genes_sum_df)
plot_df1 <- melt(extra_genes_sum_df, id = c('row.names'))
ggplot(data = plot_df1, aes(x=variable,y=value,fill = row.names)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle('Figure 8. The number of extra genes each method returns in different functions')
DEXSeq and BRIE2 show a similar proportion of the functional genes in their extra genes. Some of these proportions have a very low value, e.g. in brain, brain development and brain regulation, which indicates none of these three methods is providing extra information in terms of brain functions.
extra_genes_df <- data.frame(matrix(nrow = 3, ncol = 5))
colnames(extra_genes_df) <- c('brain', 'brain_development', 'development', 'brain_regulation', 'regulation')
rownames(extra_genes_df) <- c('DEXSeq_USA', 'eisaR', 'BRIE2')
#extra genes in
thrd = 0.05
sel_DEXSeq = (DF$DEXSeq < thrd) & (DF$eisaR > thrd) & (DF$BRIE2 > thrd)
sel_eisaR = (DF$DEXSeq > thrd) & (DF$eisaR < thrd) & (DF$BRIE2 > thrd)
sel_BRIE2 = (DF$DEXSeq > thrd) & (DF$eisaR > thrd) & (DF$BRIE2 < thrd)
sel <- list('DEXSeq_USA' = sel_DEXSeq, 'eisaR' = sel_eisaR, 'BRIE2' = sel_BRIE2)
for(i in 1:length(rownames(extra_genes_df))){
name = rownames(extra_genes_df)[i]
select = sel[[name]]
extra_genes = DF[select,]$Gene_id
extra_genes_df[name,'brain'] <- mean(as.integer(extra_genes %in%
unique(unlist(brain_genes))))
extra_genes_df[name,'brain_development'] <- mean(as.integer(extra_genes %in% unique(unlist(development_brain_genes))))
extra_genes_df[name,'development'] <- mean(as.integer(extra_genes %in%
unique(unlist(development_genes))))
extra_genes_df[name,'brain_regulation'] <- mean(as.integer(extra_genes %in% unique(unlist(regulation_brain_genes))))
extra_genes_df[name,'regulation'] <- mean(as.integer(extra_genes %in%
unique(unlist(regulation_genes))))
}
extra_genes_df[2,] <- 0
extra_genes_df
## brain brain_development development brain_regulation regulation
## DEXSeq_USA 0.08670894 0.05409603 0.2439498 0.06846124 0.6056684
## eisaR 0.00000000 0.00000000 0.0000000 0.00000000 0.0000000
## BRIE2 0.07200000 0.03200000 0.2000000 0.05600000 0.5680000
# Bar plot
extra_genes_df$row.names <- rownames(extra_genes_df)
plot_df2 <- melt(extra_genes_df, id = c('row.names'))
ggplot(data = plot_df2, aes(x=variable,y=value,fill = row.names)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle('Figure 9. The proportion of extra genes each method returns in different functions')
The goal of the project was to explore the difference between 3- and 6-month organoid cells and to compare the differential results of three methods (eisaR, BRIE2 and DEXSeq). We first observed distinct UMAP clustering and significant changes in cellular composition between 3- and 6-month organoids. Furthermore, the RNA velocity illustrated similar trends between biological replicates in the same condition but a great dissimilarity between conditions. These results implied differential regulation if the collection time differs. Therefore, we performed differential analyses on the organoid cells with eisaR, BRIE2 and DEXSeq. Several differential genes returned by each approach have relation with brain function and/or brain development. In addition, there is no significant difference between the information returned by the three methods in terms of the brain function. This is under expectation as all methods target the same from the same data. However, some differences are still visible: 1. DEXSeq provides many more significant genes, while eisaR is the most conservative. 2. DEXSeq show more agreement with potentially reasonable biological prior knowledge in development and regulation.
Clearly, the choice of the threshold largely decide the number of significant results, e.g. 313 significant genes returned by eisaR when FDR < 0.05, but only 2 significant genes when FDR < 0.01. Each method should have its own optimal threshold, which could be detected by statistical estimation. We might set a lower threshold for DEXSeq than eisaR and BRIE2 to balance their number of significant results. In addition, we could compare the differential analysis methods using simulated data so that the ground truth is known and the comparison will be informative.
I am grateful to Dr. Simone Tiberi for his idea for this paper and his excellent supervision. I would also like to thank Prof. Dr. Mark Robinson for his insightful discussion.
Sun, N., Meng, X., Liu, Y. et al. Applications of brain organoids in neurodevelopment and neurological diseases. J Biomed Sci 28, 30 (2021). doi: https://doi.org/10.1186/s12929-021-00728-4
Velasco, S., et al. (2019) ‘Individual brain organoids reproducibly form cell diversity of the human cerebral cortex’, Nature, 570: 523-527. doi: https://doi.org/10.1038/s41586-019-1289-x
Lancaster MA, Knoblich JA (2014). ‘Generation of cerebral organoids from human pluripotent stem cells.’ Nat Protoc. 9(10):2329–40.
Manno, G.L. et al. (2018) ‘RNA velocity of single cells’ Nature, 560: 494-498. doi: https://doi.org/10.1038/s41586-018-0414-6
Bergen, V., Lange, M., Peidli, S. et al. (2020) Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 38, 1408–1414. doi: https://doi.org/10.1038/s41587-020-0591-3